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rUiis  effort  sought  to  refine  and  simplify  techniques  for  generating  acoustical  signals  that  could  be 
used  in  three-dimensional  (3-D)  auditory  displays.  Such  signals  are  presented  to  a  listener  over  headphones 
and  create  the  illusion  of  a  virtual  sound  source  at  a  predetermined  position  in  3-D  space.  The  signals  are 
generated  digitally,  using  algorithms  based  on  the  acousticfrTeffects  of  human  outer  ear  structures  on  sound 
waves  reaching  the  ears.  To  date,  the  main  area  of  difficulty  inhibiting  development  of  practical  3-D  displays 
is  in  obtaining  estimates  of  these  outer  ear  effects.  The  focus  of  this  effort  was  in  this  area. 


Ihe  work  was  divided  into  three  areas:  1)  acoustical  measurements  of  free-field  to  eardrum  transfer 
functions  (also  called  head-reiaied  transfer  functions,  or  HRTFs);  2)  analysis  of  HRTFs;  and  3) 
psychophysical  assessment  of  human  performance  in  sound  localization  tasks  involving  stimuli  presented 
both  in  real  and  in  simulated  (virtual)  auditory  space.  The  focus  in  all  three  areas  was  on  evaluation  >of 
means  for  making  HRTF  measurements  faster  and  easier,  thus  simplifying  synthesis  of  auditory  stimuli 
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20.  ABSTRACT  -  continued 


In  the  measurement  area,  HRTFs  were  obtained  from  20  human  subjects  at  144  positions  in  an 
anechoic  sound  field.  A  periodic  pseudorandom  noise  averaging  technique  (Wightman  and  Kistler,  1989a) 
was  used  to  make  the  measurements.  Comparable  HRTF  measurements  were  also  obtained  from  a  KEMAR 
mannequin  (using  the  same  pseudorandom  noise  procedure)  and  from  one  of  the  original  20  subjects  using  a 
brief  click  as  a  measuring  stimulus.  The  aim  of  obtaining  HRTFs  from  KEMAR  was  to  assess  the  need  to 
base  3-D  stimulus  synthesis  on  individualized  (listener  specific)  HRTF  measurements.  If  acceptable 
measurements  could  be  obtained  from  KEMAR,  the  time-consuming  and  somewhat  risky  measurement 
procedures  involving  real  subjects  could  be  eliminated.  The  motivation  for  the  click  measurements  was  to 
evaluate  the  feasibility  of  making  HRTF  measurements  in  an  ordinary  room,  with  appropriate  gating  to 
remove  echoes. 

Analysis  of  the  HRTFs  revealed  large  inter-subject  differences,  substantial  differences  between  the 
KEMAR  HRTFs  and  those  from  any  of  the  human  subjects,  and  a  minimum  of  20  dB  loss  in  signal-to-noise 
(S/N)  ratio  accompanying  the  use  of  the  click  as  a  measuring  stimulus.  The  magnitude  component  of  the 
HRTFs  from  nearly  all  subjects  included  a  deep  notch,  usually  in  the  8-12  kHz  region,  that  was  dependent  on 
probe  microphone  position  and  independent  of  source  direction.  Further  acoustical  and  optical  measurements 
confirmed  that  this  notch  was  a  result  of  standing  waves  in  the  ear  canal.  A  principal  components  analysis  of 
the  HRTFs  was  conducted  with  the  aim  of  assessing  the  feasibility  of  constructing  ’'model'’  HRTFs  that  would 
have  the  important  features  of  real  HRTFs.  Unfortunately,  available  principal  components  algorithms  do  not 
accept  complex  data,  so  only  the  magnitude  components  of  the  HRTFs  were  analyzed.  The  analysis  revealed 
that  90%  of  the  variance  in  the  HRTFs  could  be  accounted  for  by  5  principal  components.  The  first  of  these 
confirmed  the  overall  similarity  of  the  HRTFs  across  subjects  in  the  low  frequencies,  and  the  next  two 
revealed  large  differences  across  both  subjects  and  positions  in  the  important  5-15  kHz  region. 

Extensive  psychophysical  tests,  using  techniques  developed  and  tested  previously  (Wightman  and 
Kistler  1989b),  were  conducted  on  15  adult  listeners.  In  these  tests,  stimuli  were  presented  from  36  positions 
either  in  free-field  (anechoic  chamber)  or  in  simulated  free-field  (over  headphones).  Listeners  gave 
numerical  judgements  of  apparent  azimuth  and  elevation  of  the  sources.  The  results  suggested:  1)  when 
simulated  free-field  stimuli  are  synthesized  from  HRTF  measurements  obtained  from  the  listeners'  own 
ears,  the  apparent  positions  of  the  stimuli  are  the  same  as  in  free-field;  2)  the  elevation  components  of  the 
apparent  position  judgements  of  simulated  free-field  stimuli  are  very  sensitive  to  distortions  (in  the  5-10  kHz 
region)  of  the  HRTFs  used  to  synthesize  the  stimuli  (such  as  occur  from  use  of  HRTFs  from  other  listeners  or 
from  KEMAR).  The  tentative  conclusions  of  the  psychophysical  tests  were:  1)  at  the  present  time,  the  most 
veridical  simulations  of  three-dimensional  auditory  space  require  synthesis  to  be  based  on  a  listener's  own 
HRTFs;  2)  because  of  the  sensitivity  of  the  apparent  elevation  of  simulated  sources,  great  care  must  be  taken  to 
preserve  HRTF  information  in  the  5-10  kHz  region;  3)  only  if  techniques  can  be  developed  which  offer  much 
higher  S/N  ratio  than  a  single  click  will  it  be  possible  to  obtain  the  necessary  high-frequency  detail  in  the 
HRTF  measurements  while  making  the  measurements  in  an  ordinary  room. 
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1.0  INTRODUCTION 


A  three-dimensional  (3-D)  auditory  display  has  been  identified  as  one  of  the 
virtual  technologies  associated  with  the  Air  Force  "Super  Cockpit"  project.  In 
addition  to  a  panoramic  visual  display,  the  "Super  Cockpit"  will  provide  the  pilot, 
via  headphones,  information  from  aircraft  avionics,  weapons,  and  navigation 
systems  in  a  manner  which  optimizes  the  use  of  his  spatial  and  psychomotor 
capabilities.  The  auditory  display  subsystem  will  allow  the  pilot  to  hear  threats, 
targets  and  other  operators  as  if  they  originated  from  specific  locations  in  3-D 
space.  For  example,  verbal  instructions  from  an  electronic  co-pilot  will  appear  to 
originate  from  behind  the  pilot's  head.  These  signals  will  be  directionally 
accurate  and  stabilized  in  space  regardless  of  the  pilot's  head  position. 

There  have  been  very  few  extensively  documented  (i.e.,  with  psychophysical 
data)  demonstrations  that  3-D  auditory  space  can  be  successfully  simulated  with 
headphone-presented  signals  (see,  for  example,  VVightman  and  Kistlcr,  1989b). 
However,  it  is  generally  agreed  that  veridical  spatial  simulation  requires 
preprocessing  of  the  signal,  prior  to  headphone  delivery,  so  as  to  mimic  the 
acoustic  effects  of  the  head,  shoulders,  and  outer  ears.  Such  preprocessing  is 
typically  implemented  in  the  form  of  a  digital  filter  (one  for  each  ear),  the  transfer 
function  of  which  consists,  in  part,  of  an  estimate  of  the  acoustic  free-field-to-ear- 
canal  transfer  function,  or  "head-related  transfer  function"  (HRTF)  as  it  is  often 
called.  Obtaining  estimates  of  these  transfer  functions,  in  order  to  implement  the 
digital  filters  required  for  spatial  simulation,  presents  several  significant 
problems.  Finding  solutions  to  these  problems  is  the  aim  of  this  work. 

The  first  problem  arises  because  measuring  the  HRTFs  is  technically 
demanding  and  it  is  subject  to  numerous  errors  (Mehrgardt  and  Mellert,  1977; 
Wightman  and  Kistler,  1989a.).  The  degree  to  which  these  various  sources  of 
error  contaminate  the  HRTF  measurements  in  a  perceptually  significant  way  is 
not  clear.  For  example,  the  measuring  microphones  are  very  small;  thus, 
inherently  noisy.  Hence,  positioning  the  microphone  in  a  stable  way  in  the  ear 
canal  is  difficult,  and  since  it  must  be  close  to  the  eardrum,  there  is  some  risk  to 
the  Subject.  To  reduce  extraneous  noise  and  echoes,  the  measurements  should  be 
made  in  a  soundproof  or  ideally  anechoic  room.  Another  major  problem  arises 
from  the  fact  that  the  extent  of  inter-individual  differences  in  HRTFs  across 
numerous  subjects  is  not  well-known.  The  possibility  of  potentially  large 
differences  (Wightman  and  Kistler,  1989a)  suggests  that  the  digital  filters  may 
have  to  be  individual-specific  for  the  simulations  to  be  veridical.  If  true,  this 
would  complicate  spatial  simulation  procedures  enormously,  since  the  HRTFs  of 
each  potential  listener  would  have  to  be  separately  measured.  Unless  techniques 
could  be  developed  to  make  such  measurements  in  the  field,  to  use  a  standard  set 
of  HRTF  measurements  for  all  listeners,  or  to  model  the  HRTFs  mathematically, 
simulation  of  auditory  space  via  headphones  could  remain  a  laboratory  curiosity. 

Solution  to  the  problems  outlined  above  must  come  from  research  on  both 
the  engineering  and  psychophysical  aspects  of  the  issues.  While  an  engineering 
approach  can,  for  example,  reveal  the  optimal  technique  for  modeling  HRTFs, 
only  a  psychophysical  experiment  can  reveal  the  perceptual  significance  of 
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differences  between  the  model  HRTFs  and  the  real  HRTFs.  Therefore,  a 
combined  approach  was  used  for  the  work  conducted  under  this  effort.  In 
parallel  with  developing  new  measurement  techniques,  psychophysical 
experiments  were  used  to  evaluate  the  perceptual  consequences  of  various 
strategies  for  simplifying  the  measurements  of  HRTFs  and  thereby  achieving  a 
more  immediate  practical  application. 

The  specific  focus  of  the  work  was  on  the  need  for  individualized  HRTF 
measurements.  The  approach  involved  the  following  steps: 

1.  Measurement  of  HRTFs  (both  left  and  right  ear)  for  sound  sources  at  a 
large  number  (144)  of  positions,  from  a  large  number  (20)  of  subjects,  and  from  a 
standard  mannequin  (KEMAR),  using  well  understood  and  proven  techniques 
(Wightman  and  Kistler,  1989a.) 

2.  Analysis  of  the  measured  HRTFs  to  assess  inter-individual  variability  in 
HRTF  amplitude  and  phase  characteristics  in  various  frequency  regions,  and 
evaluation  of  analytic  techniques  (e.g.,  principal  components)  for  reducing  the 
HRTFs  to  weighted  sums  of  underlying  basis  functions. 

3.  Psychophysical  assessment,  on  a  smaller  number  of  10,  of  the  perceptual 
adequacy  of  auditory  spatial  simulations  based  on  non-individualized  HRTFs, 
HRTFs  based  on  mannequin  measurements,  or,  if  the  analysis  is  successful, 
"canonical"  HRTFs  synthesized  on  the  basis  of  the  multivariate  analysis 
suggested  above. 

4.  Assessment  of  the  possible  perceptual  consequences  of  using  HRTFs 
measured  "in  the  field",  in  an  ordinary  room,  with  or  without  appropriate  gating 
to  remove  echoes. 
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2.0  MEASUREMENT  OE  HRTEs 


Our  procedure  for  producing  signals  for  a  three-dimensional  (3-D)  auditory 
display  involved  digital  synthesis  of  stimuli  which  are  then  presented  over 
headphones.  The  basic  assumption  that  guides  this  approach  is  that,  if  the 
acoustical  waveforms  at  a  listener's  eardrums  are  the  same  under  headphones 
as  in  free-field,  the  listener's  perceptual  experience  will  also  be  the  same.  Thus, 
we  ignore  the  relevance  of  head  movements,  visual  cues  and  other  localization 
cues.  However,  the  promising  psychophysical  results  obtained  to  date  (see 
Psychophysical  Experiments  section  of  this  report,  and  Wightman  and  Kistler, 
1989b)  suggest  that,  for  a  limited  range  of  listening  conditions,  the  assumption  is 
valid.  The  stimulus  synthesis  technique,  the  central  feature  of  which  is  the 
measurement  of  free-field-to-eardrum  acoustical  transfer  functions,  is  described 
in  detail  elsewhere  (Wightman  and  Kistler,  1989a).  That  description  is  reprinted 
here  for  completeness  and  readability. 

Our  approach  is  based  on  well-understood  linear  filtering 
principles.  Let  xj(t)  represent  an  electrical  signal  which  drives  a 
loudspeaker  in  free-field,  and  let  y i ( t )  represent  the  resultant 
electrical  signal  from  a  probe  microphone  positioned  at  a  listener's 
eardrum.  Similarly,  let  X2(t)  represent  an  electrical  signal  which 
drives  a  headphone,  with  y2(t)  the  resultant  microphone  response. 

Given  x -[ ( t ),  our  goal  is  to  produce  X2(t)  such  that  y2(t)  equals  y i ( t). 

We  do  this  by  designing  a  linear  filter  which  transforms  xj(t)  into  the 
desired  X2(t). 

The  design  of  the  appropriate  filter  is  best  described  in  the 
frequency  domain.  Thus,  Xj(jw),  or  simply  X]_,  is  the  Eourier 
transform  of  xj(t),  Y|  is  the  transform  of  y ^ ( t )  and  so  forth.  The  probe 
microphone's  response  to  x](t)  can  be  written: 


Y  ]  =  X !  LEM  (1) 

where  L  is  the  loudspeaker  transfer  function,  F  the  free-field  to 
eardrum  transfer  function  (sometimes  called  the  head-related 
transfer  function,  or  HRTF),  and  M  the  microphone  transfer 
function.  The  prohe  microphone's  response  to  X2<t)  can  be  written: 


Y2  =  X2  H  M  (2) 

where  H  represents  the  headphone  to  eardrum  transfer  function. 
Setting  Y]  =  Y2  and  solving  for  X2  Gelds: 


X2  =  Xj  (  L  F  )/  H 


This  equation  shows  that  the  desired  filter  transfer  function  T  is 
given  by: 


(3) 


T  =  (  L  F  )/  H 


(4) 


Thus,  if  the  signal,  x^Ct),  is  passed  through  this  filter,  and  the 
resultant,  X2(t),  is  transduced  by  the  headphone,  the  signal  recorded 
by  the  probe  microphone  at  the  eardrum  will  be  yi(t),  the  same  signal 
produced  by  the  loudspeaker  in  free-field.  This  is  represented  in  the 
frequency  domain  by  substituting  the  right  side  of  Equation  (3)  for  X2 
in  Equation  (2). 

The  filter  described  in  (4)  applies  only  to  a  single  free-field 
loudspeaker  position  and  one  ear.  To  synthesize  each  stimulus, 
then,  we  must  design  a  pair  of  filters  (one  for  each  ear)  for  each 
desired  free-field  source  position. 

The  first  phase  of  our  synthesis  procedure  involves 
measurement  of  the  free-field-to-eardrum  transfer  function  (HRTF) 
for  each  ear  of  a  subject,  for  a  large  number  of  sound  source 
positions.  In  practice,  what  we  actually  measure  is  a  quantity  like 
Yj  in  Equation  (1)  above,  which  includes  not  only  the  free-field-to- 
eardrum  characteristics  (F),  but  also  the  characteristics  of  the  test 
signal  (Xi),  loudspeaker  (L),  and  microphone  (M).  A  headphone-to- 
eardrum  transfer  function  (like  Y2  in  Equation  (2)  above)  is  also 
measured  for  each  ear  of  the  same  subject.  In  the  second  phase  of 
the  synthesis,  each  desired  experimental  stimulus  is  digitally 
filtered.  The  transfer  functions  of  the  filters  (one  for  the  left  ear 
stimulus,  and  one  for  the  right)  are  defined  in  Equation  (4)  above. 
Ideally,  when  the  filtered  stimuli  are  presented  to  the  subject  over  the 
headphones,  the  waveforms  reaching  the  eardrums  should  be 
identical  to  those  produced  by  a  free-field  stimulus.  The  error  in  the 
procedure  is  quantified  by  recording  the  stimuli  at  the  eardrums  in 
the  free-field  and  headphone  conditions  and  computing  the 
difference. 

2.1  Transfer  Function  Measurement 

Both  free-field  and  headphone  transfer  function 
measurements  were  made  using  a  technique  loosely  based  on  the 
procedure  described  by  Mehrgardt  and  Mellert  (1977).  A  wide-band, 
noise-like  signal  was  presented  (either  by  loudspeaker  or  headphone) 
repetitively,  and  the  response  at  the  listener's  eardrum  was  obtained 
by  averaging  the  output  of  a  probe  microphone.  The  Fast  Fourier 
Transform  (FFT)  of  this  response  was  divided  by  the  Fourier 
transform  of  the  signal  to  produce  an  estimate  of  the  transfer 
function  in  question.  The  signal  was  20.48  msec  in  duration,  and 
was  computed  via  an  inverse  Discrete  Fourier  Transform  (DFT)  so 
that  both  the  amplitude  and  phase  components  of  its  spectrum  could 
be  tailored  to  maximize  the  signal-to-noise  ratio  in  the  response 
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recordings.  Specifically,  the  amplitude  spectrum  of  the  signal  was 
flat  from  200  Hz  to  4000  Hz,  where  it  increased  abruptly  hy  20  dB. 
Thereafter,  it  was  flat  to  14  kHz.  The  signal  contained  no  energy 
below  200  Hz  or  above  14  kHz.  The  phase  spectrum  was  computed  to 
minimize  the  peak  factor  of  the  signal  (Schroeder,  1970).  The  signal 
was  output  continuously  (hence  with  a  repetition  frequency  of  about 
50  Hz),  via  a  16-bit  digual-to-analog  (D/A)  converter  (controlled  by  an 
1BM-PC)  at  a  rate  of  50  kHz.  No  anti-aliasing  filters  were  used.  For 
the  free-field  measurements,  the  signal  was  transduced  by  a 
miniature  loudspeaker  (Realistic  Minimus-7).  For  the  headphone 
measurements,  the  signal  was  transduced  by  a  pair  of  Sennheiser 
HD-340  headphones,  driven  in  phase.  Signals  were  presented  at 
approximately  70  dB  SPL,  a  level  chosen  to  reduce  the  contaminating 
effects  of  the  acoustic  reflex. 

The  acoustical  response  at  the  eardrum  was  measured  with  a 
miniature  electret  microphone  (Etymotic)  coupled  to  a  silicone  rubber 
probe  tube  with  an  outer  diameter  of  less  than  1  mm  (see  Figure  1). 
This  probe  microphone  system,  with  its  matching  preamplifier  and 
compensation  network,  had  a  sensitivity  of  about  50  mV/Pascal,  and 
a  frequency  response  which  was  relatively  flat  (+/-  2.5  dB)  from  200 
Hz  to  14  kHz.  Two  matched  microphones  were  used,  one  for  each 
ear,  and  the  responses  from  both  were  measured  simultaneously. 
The  amplified  microphone  outputs  were  digitized  (simultaneously) 
using  16-bit  analog-to-digital  (A/D)  converters  (controlled  by  the  IBM- 
PC)  at  a  50  kHz  sampling  rate.  The  responses  to  1000  periods  of  the 
signal  were  averaged  with  floating-point  precision,  a  spectral 
resolution  of  48.8  Hz,  and  a  worst-case  signal-to-noise  ratio  of  well 
over  20  dB  in  the  range  200  Hz  -  14  kHz. 

The  acoustical  measurements  were  made  with  the  tips  of  the 
probe-tubes  positioned  roughly  in  the  middle  of  the  subject's  ear 
canal,  about  1-2  mm  from  the  eardrum.  This  position  was  chosen  in 
order  to  be  certain  the  measurements  would  capture  all  direction- 
dependent  effects  (which  may  not  be  the  case  for  measurements  at 
the  ear-canal  entrance)  and  to  avoid  standing-wave  nulls  at  high 
frequencies.  At  14  kHz,  the  highest  frequency  of  interest  in  our  work, 
the  first  standing  wave  null  would  occur  at  about  6  mm  from  the 
eardrum  (assuming  the  ear  canal  is  a  uniform  tube  closed  at  one 
end).  To  avoid  occluding  the  ear  canals,  the  probe  tubes  were  held  in 
place  with  custom  (i.e.,  different  for  each  subject)  lucite  earmold 
shells,  trimmed  so  that  they  did  not  extend  into  the  concha  when 
inserted,  and  bored  out  to  a  thickness  of  less  than  0.5  mm.  With  the 
earmold  shell  in  place,  the  probe  tube  was  inserted  into  a  thin,  semi¬ 
rigid  guide  tube  which  was  cemented  to  the  wall  of  the  earmold  shell 
(see  Figure  2).  The  length  of  each  guide  tube  was  calibrated,  at  the 
time  the  earmold  assembly  was  made,  so  that  with  the  probe  inserted 
as  far  as  its  collar-stop  would  allow,  the  probe  tip  was  about  1  mm 
from  the  eardrum.  This  calibration  was  accomplished  by  inserting  a 
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human  hair  into  the  guide  tube  until  the  subject  indicated  that  the 
hair  had  touched  the  eardrum.  The  hair  was  then  marked  and 
withdrawn  so  that  the  appropriate  length  for  the  guide  tube  could 
then  be  determined.  The  body  of  the  microphone  was  left  hanging  at 
the  side  of  the  subject's  ear. 

For  free-field  measurements,  the  periodic  wide-band  signal 
was  transduced  by  one  of  eight  loudspeakers,  each  positioned  1.38  m 
from  the  subject  in  an  anechoic  chamber.  The  loudspeakers  were 
mounted  on  a  semicircular  arc  (2.76  m  diameter),  the  ends  of  which 
were  attached  directly  above  and  directly  below  the  subject  (see 
Figure  3).  The  loudspeakers  were  aimed  at  the  position  of  the 
subject's  head  in  order  to  minimize  the  influence  of  loudspeaker 
directionality  (which  we  found  to  be  virtually  non-existent  within  10 
degrees  of  the  speaker  axis.)  The  entire  arc  assembly  could  be  rotated 
(by  hand  cranks  around  the  vertical  axis,  and  positioned  with  a 
precision  of  about  0.5  degrees.  The  subject  was  seated  on  an 
adjustable  stool  (with  back)  so  that  his/her  head  was  at  the  center  of 
the  arc.  The  speakers  were  mounted  at  -36,  -18,  0,  +18,  +36,  +54,  +72, 
and  +90  degrees  elevation  relative  to  the  horizontal  plane  passing 
through  the  subject's  ears.  The  measurements  were  made  at  all 
elevations  except  +72  and  +90  degrees,  and  at  all  azimuths  around 
the  circle  in  15  degree  steps.  Thus,  transfer  functions  were 
measured  from  both  ears  at  144  source  positions.  Figure  4  shows  a 
block  diagram  of  the  hardware  used  to  make  the  HRTF 
measurements;  the  same  set-up  (without  microphones)  was  used  to 
produce  the  stimuli  in  the  psychophysical  experiments. 

A  typical  measurement  session  lasted  about  an  hour.  After 
the  microphones  were  fitted  in  the  subject's  ear  canals,  the  subject 
was  seated  in  the  anechoic  chamber,  and  instructed  on  how  to  set  the 
azimuth  of  the  loudspeaker  speaker  arc  using  the  hand-crank  to  turn 
the  arc.  Then,  with  the  subject  alone  in  the  chamber,  the  arc  was 
moved  to  the  first  azimuth  setting  (usually  directly  behind  the 
subject).  Depending  on  the  condition  under  study,  the  subject  either 
looked  directly  forward  and  held  his/her  head  still,  or  bit  down  on  a 
bitebar,  which  could  be  attached  rigidly  to  the  subject's  seat.  After 
the  subject  signalled  the  experimenter  that  all  was  ready, 
measurements  were  made  in  rapid  succession  at  all  six  elevations, 
in  both  ears  simultaneously.  About  2  minutes  were  required  to  make 
the  six  pairs  of  measurements  at  each  azimuth.  The  subject  then 
moved  the  arc  to  the  next  location  and  the  sequence  was  repeated. 
Finally,  after  measurements  had  been  made  at  all  24  azimuths,  the 
subject  put  on  the  headphones,  taking  care  not  to  disturb  the  position 
of  the  microphones,  and  a  pair  of  transfer  function  measurements 
were  taken  with  the  headphones  being  used  to  transduce  the  wide¬ 
band  test  signal. 
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2.2  Digital  Filter  Construction 


Each  raw  data  record  consisted  of  the  time-domain 
representation  of  a  signal  recorded  from  a  probe  microphone  in  a 
subject's  ear  canal.  This  signal  included  not  only  the  direction- 
specific  characteristics  of  the  subject’s  outer  ear  (and  head, 
shoulders,  etc.),  but  also  the  characteristics  of  the  original  test 
signal,  the  loudspeaker  (or  headphones),  and  the  measuring 
microphone.  To  obtain  an  uncontaminated  free-field-to-eardrum 
transfer  function  characteristic  (HRTF)  or  an  uncontaminated 
headphone-to-eardrum  transfer  function,  the  effects  of  the  signal, 
loudspeaker  (or  headphone),  and  microphone  must  be  removed.  This 
could  be  done  by  transforming  the  raw  data  record  into  the  frequency 
domain  (via  a  FFT)  and  dividing  by  the  frequency  domain 
representation  of  the  characteristics  of  the  signal,  the  microphone, 
and  the  loudspeaker  or  headphone.  In  our  case,  to  produce  the 
digital  filters  required  for  stimulus  synthesis,  we  divided  the 
frequency  domain  representations  of  the  signals  recorded  in  free- 
field  by  the  frequency  domain  representations  of  the  same  signals 
recorded  under  headphones.  Since  the  stimulus  and  microphone 
characteristics  appear  in  both  the  numerator  and  denominator 
terms,  they  cancel.  The  loudspeaker  characteristics  were  not 
removed  from  the  digital  filters  used  to  synthesize  stimuli.  All 
digital  signal  processing,  including  test  stimulus  generation,  FFT 
computations,  digital  filter  design  and  implementation,  and 
waveform  analysis  was  accomplished  on  a  DEC  VAX-11/750 
computer  using  the  ILS  (Signal  Technology  Inc.)  software  package. 

A  complete  set  (144  positions)  of  HRTF  measurements  was  obtained  from  20 
human  subjects  using  the  procedures  described  above.  A  representative  subset  of 
the  measurements  (all  six  source  elevations,  12  azimuths)  is  shown  for  two 
subjects  in  Figures  5-28.  In  these  figures,  the  magnitude  response  is  shown  on 
decibel  coordinates,  and  the  phase  response  (which  has  been  "unwrapped"  to 
avoid  the  usual  ambiguities  at  +  and  -  p  boundaries)  on  radian  coordinates.  A  full 
set  of  HRTF  measurements  was  also  obtained  from  KEMAR,  using  procedures 
that  differed  somewhat  from  those  used  with  human  subjects.  First,  the  KEMAR 
we  used  had  only  one  "ear"  (pinna,  canal  model  and  microphone).  Therefore,  an 
assumption  of  symmetry  around  the  vertical  median  plane  was  used  to  estimate 
HRTFs  from  the  other  ear.  Second,  KEMAR’s  own  microphone  (B&K  4134)  was 
used  to  measure  the  HRTF.  A  representative  sample  of  the  HRTFs  obtained  from 
KEMAR  is  shown  in  Figures  29-40. 

A  second  complete  set  of  HRTF  measurements  was  obtained  from  one  of  the 
20  original  subjects  using  a  brief  acoustical  impulse  (a  20  ms  unipolar  square 
pulse  transduced  by  the  loudspeakers)  presented  at  a  rate  of  about  50  per  second 
as  the  measuring  stimulus,  in  place  of  the  usual  periodic  pseudorandom  noise. 
All  other  aspects  of  the  measuring  procedure  were  the  same.  A  representative 
sample  of  the  measurements  made  with  the  click  stimulus  is  shown  in  Figures 
41-52. 
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3.0  ANALYSIS  OF  HRTFs 


The  HRTF  data  described  in  the  previous  section  were  analyzed  in  several 
different  ways,  in  order  to  assess:  1)  the  variability,  from  subject  to  subject,  in  the 
amplitude  and  phase  components  of  the  HRTFs;  2)  the  sensitivity  of  the  HRTF 
measurements  to  the  position  of  the  probe  microphone  in  the  subject's  car  canal; 
3)  the  difTerences  between  HRTFs  obtained  using  the  pseudorandom  noise  and 
HRTFs  obtained  using  the  click  stimulus;  and  4)  the  differences  between  HRTFs 
measured  from  human  subjects  and  HRTFs  measured  from  the  KEMAR 
mannequin.  In  addition,  a  subset  of  the  data  was  analyzed  with  two  procedures, 
critical  band  smoothing  and  principal  components,  in  an  effort  to  develop 
rigorous  procedures  for  representing  HRTFs  more  simply. 

3.1  Intersubject  Variability  in  the  HRTF 

It  has  been  known  for  over  20  years  (e.g.,  Shaw,  1965)  that  large  intersubject 
differences  exist  in  the  magnitude  components  of  the  HRTF.  Our  previous  work 
(Wightman  and  Kistler,  1989a)  quantified  these  difTerences  and  showed  further 
that  the  difTerences  are  greatest  in  the  5-10  kHz  region,  and  are  not  dependent  on 
source  position.  A  comparable  analysis  of  the  HRTF  data  gathered  in  this  project 
produced  similar  results.  Figures  53-56  show,  for  four  source  positions,  the  mean 
and  95%  confidence  intervals  of  the  smoothed  magnitude  of  the  HRTFs  from  our 
20  subjects.  The  magnitudes  were  smoothed  using  a  critical  bandwidth  of  0.50. 
As  reported  before  (Wightman  and  Kistler,  1989a),  the  intersubject  variability  in 
the  magnitude  response  is  greatest  in  the  5-12  kHz  region,  with  95%  confidence 
intervals  of  20  dB  or  more  not  uncommon. 

Our  approach  to  quantification  of  the  intersubject  difTerences  in  the  phase 
components  of  the  HRTF  was  based  on  the  assumption  that  interaural  phase 
differences  would  be  most  meaningful  from  a  psychophysical  point  of  view. 
Therefore,  we  ignored  the  monaural  phase  component  of  the  HRTFs  and 
examined  interaural  phase  difference.  More  specifically,  we  examined 
intersubject  difTerences  in  phase-derived  interaural  time  difference,  under  the 
assumption  that  time  difference  is  the  more  meaningful  quantity  from  a 
perceptual  standpoint.  The  interaural  time  difference  vs  frequency  functions 
from  these  subjects  are  virtually  identical  to  those  we  have  published  previously 
(Wightman  and  Kistler,  1989a),  and  the  intersubject  differences  are  nearly 
completely  determined  by  head  size. 

3.2  Sensuivity  of  HRTF  Measurements  to  Probe  Microphone  Position 

Positioning  a  microphone  probe  tube  in  a  subject’s  car  canal  for  HRTF 
measurements  is  a  significant  problem.  The  probe  must  be  far  enough  down  the 
ear  canal,  past  the  entrance,  to  capture  all  directional  effects,  close  enough  to  the 
eardrum  to  avoid  contamination  of  the  HRTF  by  standing-wave  nulls,  but  not  so 
close  to  the  eardrum  as  to  risk  injury.  In  addition,  the  probe  must  not  move 
during  the  measurement  or  when  headphones  are  worn.  Our  procedure 
attempts  to  solve  the  stability  problem  with  the  use  of  a  custom  earmold  shell  that 
holds  the  probe  in  place  (Wightman  and  Kistler,  1989a).  The  probe  is  positioned  by 
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inserting  it  a  calibrated  distance  down  a  guide  tube  that  is  attached  to  the  earmold 
shell. 


The  distance  of  the  probe  tip  from  the  eardrum  is  critical,  since  the 
standing  wave  null  in  the  measured  HRTF  response  appears  at  a  frequency 
inversely  proportional  to  that  distance  (e.g.,  the  null  is  at  8.5  kHz  if  the  probe  is  10 
mm  from  the  drum,  and  at  17  kHz  if  the  probe  is  5  mm  from  the  drum).  To  avoid 
the  standing  wave  null,  and  the  resulting  distortion  and  loss  of  HRTF 
information,  we  attempted  to  place  the  probe  2  mm  or  less  from  the  eardrum 
(Wightman  and  Kistler,  1989a).  The  method  we  used  relied  on  indirect  estimates 
of  probe-to-eardrum  distance,  relying  on  subjective  report  of  when  a  human  hair 
made  contact  with  the  eardrum.  During  the  course  of  this  project,  we  developed  a 
more  direct  method  for  estimating  the  distance  between  the  probe  and  the 
eardrum.  The  method  is  based  on  techniques  described  by  Chan  and  Geisler 
(1989).  First,  the  magnitude  components  of  the  HRTF  measurements  from  a 
large  number  of  source  positions  (144)  are  averaged,  thus  smoothing  out  the  large 
direction-dependent  spectral  features.  This  average  estimates  the  diffuse-field 
response  of  the  ear  (Shaw,  1980).  Figures  57-59  show  the  diffuse-field  response 
from  three  of  our  subjects.  Note  that  in  some  cases,  the  diffuse-field  response 
contains  an  obvious  notch  in  the  6-12  kHz  region.  This  notch  is  not  present  in 
Shaw's  (1980)  estimates  of  the  diffuse-field  response  of  the  human  ear,  and  is 
almost  certainly  a  reflection  of  the  standing-wave  null.  A  second  set  of  HRTF 
measurements  is  obtained  with  the  microphone  probe  withdrawn  a  few 
millimeters.  The  magnitudes  of  these  measurements  are  also  averaged.  The 
first  average  is  divided  by  the  second  average.  The  common  features  of  the 
diffuse-field  response  in  the  two  averages  cancel,  leaving  primarily  the  effects  of 
the  standing  wave  nulls  in  the  two  averages.  One  appears  as  a  notch  (from  the 
first  measurement,  with  the  probe  inserted  at  its  maximum  depth),  the  other 
(from  the  second  measurement)  appears  as  a  peak.  The  frequencies  of  the  notch 
and  peak  provide  estimates  of  the  probe-eardrum  distances  in  the  two  cases. 
Figures  60-62  show  the  results  of  dividing  the  two  diffuse-field  estimates  for  3  of 
our  subjects.  The  probe-eardrum  distance  estimates  derived  from  the  two  sets  of 
HRTF  measurements  are  shown  in  Table  1.  Note  that,  for  most  of  the  subjects, 
the  probe-eardrum  distance  is  considerable  greater  than  the  expected  1-2  mm. 
This  is  most  likely  a  result  of  the  subjects  not  being  able  to  determine  when  the 
human  hair  used  as  a  depth  probe  actually  touched  the  eardrum  (Wightman  and 
Kistler,  1989a).  Table  1  also  includes  estimates  of  the  probe-eardrum  distance 
derived  by  combining  an  optical  measurement  of  ear  canal  length  with  an 
estimate  of  entrance  to  probe  tip  distance  taken  from  the  earmold  assembly.  Note 
that,  in  most  cases,  the  two  probe-eardrum  distance  estimates  are  in  good 
agreement. 

We  conclude  from  the  experiments  described  above  that  HRTF 
measurements  are  very  sensitive  to  probe  position  in  the  ear  canal.  The  main 
reason  appears  to  be  a  result  of  standing  wave  patterns  in  the  earcanal. 
Unfortunately,  even  our  best  efforts  to  place  the  probe  tip  close  to  the  eardrum  (to 
avoid  the  standing  wave  problem)  did  not  completely  avoid  the  problems  in  all 
subjects.  Since  the  standing  wave  nulls  appear  in  the  5-12  kHz  region,  where 
important  localization  cues  are  believed  to  be  encoded,  it  is  possible  that  poor  probe 
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placement  for  HRTF  measurements  could  compromise  the  adequacy  of  3-D 
simulations  based  on  those  measurements.  The  extent  of  the  compromise  will  be 
evaluated  in  a  subsequent  section  of  this  report. 

3.3  Differences  Between  HRTFs  Measured  with  Pseudorandom  Noise  and 
HRTFs  Measured  Using  Clicks 

The  disadvantage  of  the  pseudorandom  noise  signal  used  to  measure 
HRTFs  is  that  since  it  is  periodic,  room  reflections  cannot  easily  be  removed,  and 
thus,  the  measurements  must  be  made  in  an  anechoic  room.  With  a  transient 
such  as  a  unipolar  click,  gating  out  reflections  is  a  rather  simple  matter.  It  is 
theoretically  possible,  then,  that  HRTFs  could  be  measured  in  an  ordinary  room  if 
a  click  signal  could  he  used.  However,  a  click  has  considerably  less  energy  than 
the  pseudorandom  noise  stimulus,  and  thus,  for  a  constant  peak  level,  the  S/N 
ratio  in  the  measurements  would  be  considerably  poorer  with  a  click  signal.  The 
peak  level  of  the  click  cannot  be  raised  to  compensate  for  the  loss  in  S/N  ratio, 
since,  at  high  levels  the  acoustic  reflex  contaminates  the  HRTF  measurements 
(Wightman  and  Kistler,  1989a). 

Figure  63  compares  HRTFs  obtained  with  the  usual  periodic 
pseudorandom  noise  (PRN)  signal  with  HRTFs  measured  using  a  click,  adjusted 
to  maximum  feasible  level.  Note  that,  while  the  click  HRTFs  have  the  same 
general  shape  as  the  PRN  HRTFs,  the  magnitude  is  at  least  20  dB  less.  At  high 
frequencies,  where  the  HRTF  is  normally  reduced  in  magnitude,  the  click  HRTF 
is  rather  different  from  the  PRN  HRTF,  probably  as  a  result  of  poor  S/N  ratio.  The 
perceptual  significance  of  these  differences  will  be  evaluated  in  a  subsequent 
section  of  this  report. 

3.4  Differences  Between  HRTFs  from  KEMAR  and  HRTFs  from  Humans 

Given  the  large  intersubject  differences  in  HRTF  described  above,  we  were 
led  to  expect  large  differences  between  HRTFs  obtained  from  KEMAR  and  those 
obtained  from  human  subjects.  Figures  53-56  summarize  those  differences  for 
four  source  positions.  Note  that  while  the  KEMAR  HRTFs  generally  fall  within 
the  range  of  HRTFs  obtained  from  humans,  there  are  consistent  differences,  the 
most  striking  of  which  is  that  the  KP1MAR  HRTFs  are  higher  in  the  high 
frequency  regions.  The  effect  of  these  differences  on  the  adequacy  of  3-D  auditory 
simulations  produced  from  the  KEMAR  measurements  can  only  be  determined 
from  psychophysical  experiments.  The  results  of  an  experiment  in  which  human 
listeners  localized  simulated  sources  produced  from  KEMAR’s  HRTFs  are 
described  in  Section  4.4. 

3.5  Simplification  of  HRTFs 

The  acoustical  measurement  procedures  we  used  produce  a  very  detailed 
representation  of  the  HRTF.  Impulse  responses  are  represented  with  1024  time 
points,  and  spectra  with  512  complex  spectral  values.  Given  the  reduced  spectral 
resolving  power  of  the  human  auditory  system  at  high  frequencies,  it  seems  clear 
that  without  sacrificing  perceptually  relevant  detail,  HRTFs  could  be  represented 
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with  considerably  loss  resolution.  Use  of  simplified  HRTFs  is  desirable  since 
computational  efficiency  would  be  increased. 

We  have  evaluated  two  different  approaches  to  the  problem  of  simplifying 
HRTFs.  One  involves  smoothing  the  spectral  representation  of  the  HRTF,  and  the 
other  attempts  to  model  the  HRTFs  with  a  set  of  underlying  basis  functions, 
determined  from  a  principal  components  analysis.  While  the  smoothing 
technique  has  proven  useful  for  display  purposes,  neither  smoothing  nor 
principal  components  analysis  has  yet  led  to  significant  reductions  in  the  extent  of 
computation  required  for  3-D  display  synthesis. 

Our  smoothing  algorithm  is  based  on  the  well-established  fact  that  human 
spectral  resolving  power  diminishes  with  increasing  frequency.  Spectral 
smoothing  can,  in  general,  be  viewed  as  the  convolution  of  the  unsmoothed 
spectrum  (magnitude  and  phase  separately)  with  a  filter.  The  extent  of 
smoothing  is  given  by  the  bandwidth  of  this  filter,  and,  to  some  extent,  its  shape. 
In  our  case,  the  filter  had  a  Gaussian  shape  and  its  bandwidth  was  set  equal  to 
the  average  human  "critical  bandwidth",  the  usual  measure  of  spectral  resolving 
power  (Sharf,  1970).  Since  critical  bandwidth  increases  with  frequency,  the 
resulting  smoothing  is  greater  at  high  frequencies.  Figure  64  shows  the  effect  of 
different  amounts  of  smoothing  (expressed  in  terms  of  fractions  of  the  normal 
human  critical  bandwidth)  on  the  HRTF  measurements. 

We  have  found  that  smoothing  the  HRTF  measurements  is  useful  for 
describing  certain  features  of  the  HRTFs  in  a  manner  that  is  reasonable  from  a 
psychophysical  point  of  view.  For  example,  earlier  in  this  report,  estimates  of  the 
between-subjects  variability  in  the  HRTF  were  presented  (Figures  53-56).  Had 
smoothing  not  been  used  to  remove  the  large,  narrow  peaks  and  troughs  in  the 
HRTFs  at  high  frequencies,  the  between-subjects  variability  in  this  frequency 
region  would  have  been  grossly  overstated.  While  it  is  possible  that  smoothing 
might  be  used  in  some  way  to  reduce  the  complexity  of  the  HRTFs  used  in  the  3-D 
synthesis  algorithms,  we  have  not  done  so  to  date. 

Our  second  attempt  at  simplifying  HRTFs  involved  a  principal  components 
analysis  (PCA).  The  central  idea  of  PCA  is  to  reduce  the  dimensionality  of  a  data 
set  in  which  there  are  a  large  number  of  interrelated  measures,  while  retaining 
as  much  as  possible  of  the  variation  present  in  the  data.  This  reduction  is 
accomplished  by  transforming  the  original  data  to  a  new  set  of  measures,  the 
principal  components,  which  are  uncorrelated.  These  components  are  extracted 
in  an  orderly  fashion  so  that  the  first  component  reflects  the  majority  of  common 
variation  and  the  remaining  components  reflect  decreasing  common  variation 
and  increasing  unique  variation.  Thus,  PCA  can  potentially  provide  a 
mechanism  for  describing  the  important  spectral  features  of  HRTFs  and  allow  an 
HRTF  for  a  given  spatial  location  to  be  derived  from  a  small  set  of  "basic" 
functions  ( i . e . ,  the  principal  components). 

The  major  deterrent  to  using  PCA  to  simplify  HRTFs  is  that  these  functions 
are  complex.  It  is  inappropriate  to  perform  traditional  PCA  on  magnitude 
functions  and  phase  functions  separately  and  then  combine  the  magnitude  and 


phase  principal  components  to  reproduce  the  original  HRTFs.  Although 
mathematical  algorithms  for  complex  PCA  have  been  derived,  the  techniques 
involve  sophisticated  mathematics  and  require  many  hours  of  computer  time  to 
perform.  Moreover,  the  computer  software  to  implement  complex  PCA  is  not 
widely  available.  Consequently,  it  is  difficult  to  evaluate  the  usefulness  of  this 
technique  since  it  has  been  used  so  infrequently.  Before  embarking  on  complex 
PCA,  we  decided  to  investigate  the  usefulness  of  PCA  for  data  reduction  and  for 
identifying  important  features  of  HRTFs,  by  first  analyzing  the  magnitude 
functions. 

Principal  components  analysis  was  performed  on  the  magnitude  estimates 
of  the  144  measurements  of  each  of  the  20  subjects.  Only  the  data  in  the  frequency 
region  from  200  Hz  to  15000  Hz  was  analyzed.  The  general  result  was  that 
approximately  90%  of  the  variation  in  the  144  magnitude  spectra  could  be 
accounted  for  by  5  principal  components.  That  is,  the  frequency  region  between 
200  and  15000  Hz  could  be  reduced  to  5  measures  with  very  little  loss  of 
information.  The  amount  of  variation  captured  by  5  components  did  vary 
somewhat  across  subjects,  ranging  from  86 %  to  92%.  Additionally,  there  were 
significant  differences  in  the  composition  of  the  components  across  subjects, 
although  there  were  also  some  similarities.  This  result  is  not  surprising  since 
we  have  observed  large  individual  differences  in  the  HRTFs  above  5  kHz,  and 
similarities  below  5  kHz.  The  first  three  principal  components  are  plotted  for  5 
representative  subjects  in  Figures  65-69.  The  remaining  two  components  tended 
to  account  for  smaller  amounts  of  variation  in  very  narrow  frequency  regions, 
and  thus  were  not  included  in  the  figures.  The  functions  plotted  in  these  figures 
reflect  the  contribution  of  each  frequency  to  the  component  by  it's  correlation  with 
the  component.  Correlations  near  1.0  or  -1.0  are  indicative  of  a  large  amount  of 
variation  accounted  for  by  the  component  (r^  is  the  amount  of  variance  explained 
by  a  component),  while  a  correlation  near  0  suggests  minimal  contribution  to  the 
component. 

The  first  principal  component  is  comprised  of  high  correlations  in  the 
frequency  region  between  200  Hz  and  5000  Hz  for  all  subjects.  The  high 
correlations  in  this  frequency  region  reflects  the  fact  the  144  HRTFs  are  highly 
similar  in  the  low  frequency  region.  The  correlations  on  the  first  component  in 
the  region  between  5  kHz  and  15  kHz  are  somewhat  lower  and  more  variable, 
reflecting  the  greater  variation  in  the  region  with  sound  source  location. 
Components  2  and  3  account  for  most  of  the  remaining  variation  in  the  high 
frequencies.  The  pattern  of  correlations  on  Components  2  and  3  differ  from 
subject  to  subject  and  corroborate  our  previous  accounts  of  interindividual 
variability  in  the  high  frequency  region. 

In  summary,  principal  components  analysis  (PCA)  of  the  magnitude 
components  of  the  HRTFs  revealed  that  90%  of  the  variance  could  be  accounted  for 
by  5  principal  components.  The  first  of  these  confirmed  the  overall  similarity  of 
HRTF  magnitude  across  subjects  in  the  low  frequencies,  and  the  next  two 
revealed  large  differences  both  across  subjects  and  across  source  positions  in  the 
important  5  kHz- 15  kHz  region.  We  conclude  that  while  principal  components 
analysis  of  1 1  RTF  data  in  the  frequency  domain  is  useful  for  describing  the 
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common  features  of  HRTF  magnitude  functions,  it  is  not  appropriate  for 
simplification  or  regeneration  of  HRTFs.  Either  the  time-domain  PCA  developed 
by  Molenaar  (1985),  or  the  complex  frequency  domain  PCA  described  by  Brillinger 
(1975)  may  be  applicable  to  this  problem,  but,  as  yet,  these  procedures  have  not 
been  thoroughly  evaluated. 
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4.0  PSYCHOPHYSICAL  EXPERIMENTS 


This  project  involved  extensive  psychophysical  testing.  The  aim  of  the  tests 
was  to  evaluate  the  perceptual  adequacy  of  auditory  images  synthesized  from  the 
HRTF  measurements.  The  influence  of  various  distortions  of  the  HRTFs  on  the 
perceptual  adequacy  was  of  particular  interest.  More  specifically,  the 
psychophysical  tests  were  intended  to  address  the  following  questions: 

1)  What  is  the  ability  of  subjects  to  estimate  the  azimuth  and  elevation  of 
virtual  sources  produced  from  the  subjects'  own  HRTF  measurements,  and  how 
does  this  ability  compare  to  the  subjects’  ability  to  localize  real  sources  in  free- 
field? 


2)  How  is  a  subject's  ability  to  localize  virtual  sources  affected  if  the  sources 
are  produced  from  non-individualized  HRTFs  (e.g.,  from  HRTFs  measured  on 
another  subject  or  on  KEMAR)? 

3)  What  is  the  influence  of  alterations  in  the  magnitude  of  the  HRTF  above  5 
kHz  on  a  subject's  estimates  of  the  apparent  azimuth  and  elevation  of  virtual 
sources  synthesized  from  the  HRTF. 

The  psychophysical  methods  were  identical  to  those  described  in  Wightman 
and  Kistler,  1989b.  For  completeness,  that  description  is  reprinted  here. 

4.1  Stimuli 

The  basic  stimulus  in  this  experiment  was  a  train  of  eight  250- 
ms  bursts  of  Gaussian  noise  (20  ms  cosine-squared  onset-  offset 
ramps),  with  300  ms  of  silence  between  the  bursts.  The  noise  bursts 
were  presented  at  an  overall  level  of  about  70  dB  SPL.  The  Gaussian 
noise  was  band-passed  with  a  lOth-order  digital  Finite  Impulse 
Response  (FIR)  band-pass  filter  between  200  Hz  and  14  kHz.  The 
energy  spectrum  of  the  noise  was  shaped  (differently  for  each 
stimulus)  according  to  an  algorithm  which  divided  the  spectrum  into 
critical  bands,  and  assigned  a  random  intensity  (uniform 
distribution,  20  dB  range)  to  the  noise  within  each  critical  band.  This 
trial-by-trial  randomization  of  stimulus  spectrum  was  used  to 
prevent  listeners  from  becoming  familiar  with  specific  stimulus  or 
transducer  characteristics. 

The  noise  stimuli  were  presented  either  by  loudspeaker  or  by 
headphones.  In  the  former  condition,  the  stimulus  was  routed  to  one 
of  six  small  loudspeakers  (Realistic  Minimus-7).  The  loudspeakers 
were  chosen  to  have  similar  response  characteristics  (+/-  5  dB  from 
200  Hz  to  14  kHz),  so  no  attempt  was  made  to  compensate  for 
loudspeaker  differences  beyond  the  trial-by-trial  stimulus  spectral 
shaping  described  above.  The  loudspeakers  were  mounted  on  a 
semicircular  steel  arc,  2.76  m  in  diameter,  the  ends  of  which  were 
attached  to  bearings  directly  above  and  below  the  subject's  seat  in  an 
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anechoic  chamber.  The  subject  was  seated  on  an  adjustable  stool 
such  that  his/her  head  was  at  the  center  of  the  arc  of  loudspeakers. 
The  arc  could  be  rotated  around  the  vertical  axis,  thus  allowing 
stimulus  presentation  at  any  azimuth,  and  at  any  one  of  six 
elevations.  The  loudspeakers  were  positioned  at  the  following 
elevations  relative  to  the  horizontal  plane  passing  through  the 
subject's  ears:  54  degrees,  36  degrees,  18  degrees,  0  degrees,  -18 
degrees,  and  -36  degrees. 

For  headphone  conditions,  the  noise  bursts  were  transduced 
by  Sennheiser  dynamic  headphones  (HD-340).  Each  headphone 
stimulus  was  digitally  processed  so  that  it  would  simulate  a  specific 
free-field  stimulus.  This  processing  compensated  for  the 
characteristics  of  the  headphones,  and  superimposed  a  given 
subject  s  direction-specific  outer  ear  characteristics  (HRTF)  on  the 
stimulus  (Wightman  and  Kistler,  1989a).  Production  of  each 
stimulus  involved  passing  a  shaped  burst  of  Gaussian  noise, 
spectrally  contoured  according  to  the  algorithm  described  above, 
through  two  digital  Filters,  one  for  the  left-ear  stimulus,  and  the  other 
for  the  right-ear  stimulus.  Each  digital  filter  consisted  of  two 
cascaded  sections.  The  first  was  the  filter  described  in  the 
companion  paper  (Equation  4  from  Wightman  and  Kistler,  1989a) 
which  includes  the  subject's  HRTF  for  a  given  ear  and  source 
position  and  the  inverse  of  the  subject's  headphone-to-ear-canal 
transfer  function  for  that  same  ear.  The  HRTF  and  headphone 
transfer  functions  were  measured  according  to  the  procedures 
described  in  the  companion  paper  (Wightman  and  Kistler,  1989a). 
The  second  section  was  a  zero-phase  band-pass  filter  (200  Hz  to  14 
kHz)  that  was  used  to  eliminate  processing  artifact  at  low  and  high 
frequencies.  Finally,  since  the  particular  D/A  system  used  to  output 
the  stimuli  (Ariel  DSP-16)  imposed  a  constant  10  ms  delay  between 
left  and  right  stimuli,  a  10  ms  time-shift  was  added  to  the  phase 
response  of  the  right  band-pass  filter  section  to  compensate  for  the 
delay.  Stimuli  were  filtered  in  the  frequency  domain,  using 
techniques  based  on  the  overlap  and  add"  FFT  algorithm  described 
by  Stockham  (1966). 

Stimuli  for  a  given  subject  and  a  given  run  were  precomputed 
(using  Signal  Technology  Inc.'s  IES  software  on  a  DEC  VAX-1  1/750) 
and  stored  on  an  IBM-PC  disk.  They  were  then  converted  to  analog 
form  via  PC-controlled  16-bit  D/A  converters  at  a  50  kHz/channel 
rate.  No  antialiasing  filters  were  used,  since  the  nearest  aliased 
components  were  at  36  kHz,  well  beyond  the  range  of  hearing. 
Stimuli  were  presented  at  about  70  dB  SPL  in  free-field,  and  at 
approximately  the  same  level  under  headphones.  The  digital 
processing  of  the  headphone  stimuli  preserved  all  the  interaural 
level  and  time  differences,  and  the  slight  position-to-position  level 
differences  (e.g.,  from  front  to  hack)  that  existed  in  free-field.  Figure 
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4  shows  a  block  diagram  of  the  hardware  used  in  the  psychophysical 
experiments. 

4.2  Procedure 

The  aim  of  this  experiment  was  to  compare  the  apparent 
positions  of  sounds  presented  in  free-field  and  under  headphones. 
Therefore,  we  felt  that  the  paradigm  used  to  quantify  apparent 
spatial  position  must  be  the  same  for  both  free-field  and  headphone 
listening.  After  considerable  pilot  work  in  which  we  compared  the 
strengths  and  weaknesses  of  a  number  of  techniques  (Wightman  and 
Kistler,  1980),  we  chose  an  "absolute  judgement”  technique.  With 
this  procedure,  a  subject  indicates  the  apparent  spatial  position  of  a 
sound  source  by  calling  out  numerical  estimates  of  apparent 
azimuth  and  elevation,  using  standard  spherical  coordinates.  (In 
our  previous  work  with  this  procedure,  we  also  asked  for  distance 
estimates.)  To  give  some  examples,  a  sound  heard  directly  in  front 
would  produce  a  response  "0,0",  a  sound  heard  on  the  right  and 
slightly  elevated  would  produce  "90,10 ",  a  sound  heard  on  the  left  and 
below  the  horizontal  plane  would  produce  "-90,  -10",  and  a  sound  in 
the  rear  and  well  elevated  would  produce  "180,  60". 

We  were  initially  concerned  that  our  subjects  would 
demonstrate  a  wide  range  of  skill  with  the  absolute  judgement 
paradigm,  and  that  this  source  of  variance  would  contaminate  our 
results.  It  would  then  be  difficult  to  separate  individual  differences 
in  localization  ability  from  individual  differences  in  position 
estimation  skill.  However,  for  several  reasons,  we  proceeded 
anyway.  First,  our  main  interest  was  the  comparison  of 
performance  in  free-field  with  performance  under  headphones,  and 
both  would  be  measured  with  the  absolute  judgement  procedure. 
Second,  our  subjects  appeared  to  learn  the  procedure  very  quickly 
and  produced  very  stable  judgements.  Nevertheless,  all  subjects 
were  given  10  hours  of  experience  in  the  free-field  listening  condition 
before  final  data  were  collected. 

The  free-field  condition,  which  was  tested  first  for  10  subjects, 
required  subjects  to  estimate  the  apparent  position  of  sounds 
delivered  from  36  different  positions,  covering  a  360  degree  range  of 
azimuths  and  elevations  from  36  degrees  below  the  horizontal  plane 
to  54  degrees  above  it.  The  source  locations  were  chosen  from  a  list  of 
144  potential  positions,  which  were  those  at  which  each  subject's 
HRTFs  had  been  measured  (Wightman  and  Kistler,  1989a).  The 
choice  was  made  with  the  aim  of  sampling  the  possible  range  of 
azimuths  and  elevations  equally.  Later  in  the  experiment,  after 
subjects  had  completed  testing  in  both  free-field  and  headphone 
conditions,  a  second  set  of  36  positions  was  selected  and  7  of  the  8 
subjects  were  tested  again  in  both  free-field  and  headphone 
conditions.  Table  2  gives  the  coordinates  of  all  72  source  locations, 
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and  shows  how  they  were  divided  into  low',  "middle',  and  "high" 
elevations,  and  "front”,  side",  and  back"  azimuths  for  later 
analysis. 

At  the  beginning  of  a  run  in  the  free-field  condition,  subjects 
were  blindfolded,  led  into  the  anechoic  chamber,  and  seated  at  the 
center  of  the  loudspeaker  arc  (no  subject  saw  the  inside  of  the 
anechoic  chamber  or  the  loudspeaker  arrangement  at  any  time 
during  free-field  testing).  The  subject  was  instructed  to  look  straight 
ahead  and  not  to  move  the  head  while  a  trial  was  in  progress.  The 
experimenter,  who  was  present  with  the  subject  in  the  chamber  in 
order  to  move  the  loudspeaker  arc  and  to  record  the  subject’s 
responses,  verified  head  position  and  stability.  Each  trial  began  with 
the  presentation  of  a  15  second  burst  of  white  Gaussian  noise  from  a 
loudspeaker  (not  one  of  those  used  for  localization)  mounted  in  front 
(or,  in  a  separate  condition,  behind)  of  the  subject  at  floor  level.  The 
purpose  of  this  noise  was  to  mask  the  sounds  made  by  moving  the 
loudspeaker  arc,  which  was  positioned  by  the  experimenter  during 
this  15  second  pretrial  period.  When  questioned  later,  all  subjects 
reported  that  they  could  not  detect  the  movement  of  the  loudspeaker 
arc.  After  the  masking  noise  terminated,  the  stimulus  was 
presented.  Recall  that  each  stimulus  consisted  of  eight  250  ms 
identical  bursts  of  spectrally-contoured  noise.  During  a  5  second 
silent  period  immediately  after  termination  of  the  stimulus,  the 
subject  called  out  azimuth  and  elevation  estimates,  and  the 
experimenter  entered  the  responses  on  a  data  sheet  (no  feedback  was 
given  to  the  subjects).  A  new  trial  began  with  the  experimenter 
repositioning  the  loudspeaker  arc  according  to  a  script  shown  on  the 
data  sheet.  The  experimenter  attempted  to  move  the  arc  for  about  the 
same  length  of  time,  regardless  of  the  required  azimuth.  During 
stimulus  presentation  the  experimenter  moved  to  a  corner  of  the 
chamber,  so  as  to  be  acoustically  unobtrusive.  On  a  given  run, 
subjects  heard  a  stimulus  from  each  of  the  36  locations  once;  the 
order  of  locations  presented  on  each  run  was  random.  Each  36-trial 
run  lasted  about  20  minutes,  and  breaks  of  about  5  minutes  were 
taken  after  each  run. 

The  procedure  for  the  headphone  condition  was  nearly 
identical  to  that  used  for  the  free-field  condition,  except  that  the 
subjects  heard  the  stimuli  over  headphones.  To  avoid  the  potential 
influence  of  visual  cues,  the  subjects  were  blindfolded  as  in  the  free- 
field  condition,  even  though  they  had  seen  the  inside  of  the  anechoic 
chamber  during  the  acoustical  measurement  phase  of  the 
experiment,  which  came  after  free-field  testing.  They  were  also 
seated  in  the  anechoic  chamber  during  headphone  testing.  The  trial 
sequence  was  the  same  as  for  the  free-field  condition,  except  that  no 
masking  noise  was  presented  before  each  trial.  After  each  stimulus 
was  presented,  and  the  subject  called  out  azimuth  and  elevation 
estimates,  the  experimenter,  who  was  outside  the  chamber  and 
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listening  over  an  intercom,  entered  the  responses  on  a  PC  keyboard. 

As  before,  each  run  required  estimates  of  36  source  positions,  and 

because  of  the  slightly  faster  pace,  about  4  runs  were  completed  in 

each  90  minute  session. 

Fifteen  subjects  participated  in  the  psychophysical  testing  phase  of  this 
project.  The  subjects  were  young  adults,  6  male  and  9  female,  with  normal 
hearing  as  verified  by  audiometric  screening  at  15  dBHL.  None  of  the  subjects 
had  any  previous  experience  in  psychoacoustical  experiments,  and  all  were  naive 
regarding  the  purpose  of  the  project.  All  of  the  subjects  completed  the  free-field 
and  simulated  free-field  conditions;  only  5  of  the  subjects  participated  in  all 
experiments  described  below. 

Each  subject  completed  12  runs  in  the  free-field  condition  before  HRTF 
measurements  were  made.  Then,  8  runs  or  more  were  completed  in  each  of  the 
conditions  involving  virtual  sources.  Aside  from  testing  the  free-  field  condition 
first,  and  the  simulated  free-field  condition  (simulations  were  based  on  each 
subject's  own  HRTF  data)  second,  no  attempt  was  made  to  present  the  various 
conditions  in  either  a  random  or  a  counterbalanced  order.  Our  previous  work 
(Wightman  and  Kistler,  1989b)  suggested  that  once  performance  had  stabilized 
(within  5-6  runs),  order  effects  would  contribute  little  to  the  data.  Analysis  of  the 
psychophysical  data  consisted  of  computing  the  judgement  centroid  (the 
"average”  apparent  direction)  for  each  stimulus  in  each  condition,  for  each 
subject  separately.  The  results  are  presented  in  the  form  of  scatterplots  of  actual 
(or  intended,  in  the  case  of  virtual  sources)  source  azimuth  and  elevation  vs. 
judged  source  azimuth  and  elevation  (the  latter  given  by  the  judgement  centroid). 

4.3  Free-Field  versus  Simulated  Free-Field 

Figures  70-84  show  the  results  from  the  conditions  in  which  15  subjects 
localized  real  sources  in  free-field,  and  then  localized  virtual  sources  presented 
over  headphones.  The  virtual  sources,  in  this  case,  were  synthesized  from  each 
subject's  own  HRTFs,  and  no  distortion  was  introduced  in  the  HRTFs.  Table  3 
summarizes  the  data  in  a  form  that  allows  assessment  of  both  individual  and 
average  performance  in  various  sectors  of  auditory  space.  Table  3  also  presents 
data  on  intersubjcct  differences  in  performance. 

One  important  result  that  can  be  seen  in  the  data  is  that  in  all  but  one  case 
(subject  SDE),  subjects  judge  the  apparent  azimuth  and  elevation  of  real  sources 
accurately.  Note  also  that  intersubject  differences  in  localization  performance 
appear  only  in  the  elevation  components  of  the  judgements;  all  subjects  judge 
apparent  source  azimuth  about  equally  well,  but  there  are  substantial  differences 
in  ability  to  judge  apparent  elevation. 

The  most  important  result  is  that  the  pattern  of  results  from  the  free-field 
condition  is  duplicated  with  virtual  sources  presented  over  headphones.  In  every 
case,  including  that  of  the  one  "poor"  subject,  SDE,  the  free-field  data  closely 
match  the  headphone  data.  Where  minor  discrepancies  appear  (e.g.,  subject 
SED),  they  are  only  in  the  elevation  component  of  the  judgements. 
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These  results  clearly  confirm  the  perceptual  adequacy  of  our  3-D  auditory 
display  techniques  in  those  conditions  in  which  each  subject's  stimuli  are 
individually  tailored  through  the  use  of  the  subject's  own  HRTFs  in  the  synthesis. 
The  importance  of  individualized  stimulus  synthesis  will  be  explored  in  the  next 
series  of  experiments,  using  the  results  just  discussed  as  a  benchmark  for 
comparison. 

4.4  Use  of  Non-Individualized  HRTFs 

Two  conditions  were  studied  in  which  subjects  localized  virtual  sources 
synthesized  from  HRTFs  measured  from  other  than  their  own  ears.  In  one,  11  of 
the  original  subjects  localized  stimuli  synthesized  from  the  HRTFs  measured 
from  the  ears  of  our  "best"  subject.  In  the  other  condition,  7  subjects  (6  of  the  11 
mentioned  above,  plus  the  "best”  subject)  localized  stimuli  synthesized  from 
KKMAR’s  HRTFs. 

Figures  85-95  show  the  data  obtained  from  the  first  condition.  The  results 
can  be  easily  summarized.  First,  the  use  of  another  subject's  HRTFs  for  stimulus 
synthesis  causes  no  more  than  minor  alterations  in  the  azimuth  components  of 
judgements  of  apparent  source  position.  Second,  with  regard  to  the  elevation 
components  of  the  judgements,  performance  is  never  as  good  as  with  stimuli 
synthesized  from  the  subject's  own  HRTFs.  In  other  words,  if  a  "good"  subject 
(i.e.,  a  subject  who  judges  apparent  elevation  accurately)  listens  to  stimuli 
synthesized  from  another  "good"  subject’s  HRTFs,  there  are  slight  degradations 
of  elevation  performance.  If  a  "poor"  subject  (e.g.,  SDE)  listens  to  stimuli 
synthesized  from  a  "good"  subject,  there  is  no  improvement.  Table  4  summarizes 
the  data  from  this  condition. 

Figures  96-102  show  data  from  the  second  condition,  in  which  seven 
subjects  localized  stimuli  synthesized  from  HRTF  measurements  made  on 
KFMAR.  The  results  are  similar  to  those  from  the  first  condition.  First, 
judgements  of  apparent  source  azimuth  are  relatively  unaffected  by  the  use  of 
KEMAR's  HRTFs.  Plowever,  there  was  a  substantial  increase  in  front/back 
confusions  for  most  subjects.  Second,  for  about  half  of  the  subjects,  apparent 
source  elevation  was  only  slightly  distorted,  and  for  the  other  half  it  was  badly 
distorted.  These  distortions,  which  are  manifest  by  lowered  apparent  elevation  for 
sources  at  positive  elevations,  occurred  with  sources  at  all  azimuths.  An  example 
of  this  effect  can  be  seen  in  Figure  98,  which  shows  that  for  all  sources  (regardless 
of  azimuth)  with  elevations  above  zero,  judged  elevation  is  close  to  zero.  It  is 
tempting  to  attribute  the  differences  between  the  subjects'  performance  with 
KEMAR-based  stimuli  to  HRTF  differences.  However,  we  have  been  unable  to 
trace  any  subject's  poor  (or  good)  performance  with  KEMAR-based  stimuli  to 
differences  (or  similarities)  between  that  subject's  own  HRTFs  and  KEMAR’s. 
Table  5  presents  the  summary  data  from  this  condition. 
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4.5  Effects  of  High  Frequency  Distortions  in  the  HRTF 

Two  aspects  of  the  results  reported  here  and  elsewhere  (e.g.,  Wightman 
and  Kistler,  1989b)  argue  indirectly  for  the  importance  of  high  frequencies  to  the 
perception  of  source  elevation.  First,  analysis  of  the  HRTFs  reveals  that  between- 
subjects  differences  in  the  HRTFs  are  greatest  at  high  frequencies.  Second, 
results  from  the  experiments  with  nonindividualized  HRTFs  suggest  that 
accurate  elevation  perception  requires  that  a  subject's  own  HRTF  be  used  to 
synthesize  the  stimuli.  Use  of  a  different  subject's  HRTF  would,  presumably, 
distort  the  high  frequency  region  more  than  the  low  frequency  region. 

The  experiment  described  here  was  designed  to  assess  the  importance  of 
various  frequency  regions  directly.  Seven  of  the  original  15  subjects  localized 
virtual  sources  (synthesized  from  their  own  HRTFs)  in  which  the  energy  in 
various  frequency  regions  had  been  removed  by  Filtering.  The  filtering  was 
accomplished  digitally,  with  a  high-order  FIR  Filter,  so  stop-band  attenuation  was 
at  least  80  dB,  and  the  transition  band  was  very  narrow  (Filter  skirts  were  very 
steep).  Four  conditions  were  studied:  1)  5  kHz  low-pass  (all  energy  above  5  kHz 
removed);  2)  5  kHz  high-pass  (all  energy  below  5  kHz  removed);  3)  10  kHz  low- 
pass,  and  4)  10  kHz  high-pass.  Six  of  the  seven  subjects  were  tested  in  all  four 
conditions;  the  remaining  subject  was  tested  in  three  of  the  four  conditions.  The 
psychophysical  procedure  was  identical  to  that  used  in  the  other  experiments 
described  above.  Since  performance  was  comparable  across  all  subjects,  only  the 
data  from  one  subject  (who  completed  all  four  conditions)  will  be  shown.  Figures 
103-106  show  the  results  from  the  filtering  experiment  in  scatterplot  form,  and 
Tables  6-9  present  the  summary  data  for  conditions  1-4,  respectively. 

Note  first,  that  in  all  conditions  except  the  10  kHz  high-pass  condition, 
apparent  source  azimuth  was  virtually  unaffected  by  filtering.  Apparent 
elevation,  however,  was  dramatically  affected  in  some  conditions.  Consider  the  5 
kHz  low-pass  condition  (Figure  103).  Here,  with  all  energy  above  5  kHz  removed, 
the  apparent  elevation  of  all  sources  is  close  to  zero  degrees.  This  strongly 
suggests  that  the  cues  for  source  elevation  are  encoded  in  frequencies  above  5  kHz. 
The  results  from  the  5  kHz  high-pass  condition  (Figure  104),  in  which  elevation 
perception  appears  normal,  confirm  the  importance  of  energy  above  5  kHz.  Next, 
consider  the  10  kHz  high-pass  condition  (Figure  105).  With  all  low-frequency 
information  removed,  elevation  perception  is  also  seriously  degraded.  The  results 
from  the  former  three  conditions,  combined  with  the  results  from  the  10  kHz  low- 
pass  condition  (Figure  106),  which  show  normal  elevation  perception,  indicate 
that  the  major  cues  to  source  elevation  lie  in  the  5-10  kHz  frequency  region. 
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5,0  IMPLICATIONS  FOR  PRACTICAL  APPLICATION  OF 
3-D  AUDITORY  DISPLAY  TECHNOLOGY 


The  psychophysical  results  lead  us  to  conclude  that  for  the  most  veridical  3- 
D  auditory  display,  a  listener  s  own  HRTFs  should  be  used  to  synthesize  the 
virtual  sound  sources.  If  only  azimuth  information  is  to  be  conveyed  in  the 
display,  the  synthesis  requirements  can  be  relaxed  considerably.  However,  if  the 
apparent  elevation  of  a  source  is  important,  individualized  HRTFs  appear  to  be 
essential.  This  conclusion  is  quite  different  from  that  reached  by  Butler  and 
Belendiuk  (1977)  in  their  frequently-cited  paper  on  median-plane  localization.  In 
the  Butler  and  Belendiuk  study,  four  listeners  localized  noises  that  had  been 
recorded  from  microphones  placed  either  in  their  own  ears  or  in  the  ears  of  other 
listeners.  While  three  of  the  four  subjects  showed  no  effect  or  degradation  in 
localization  performance  with  non-individualized  stimuli,  one  of  the  subjects 
appeared  to  localize  more  proficiently  with  stimuli  recorded  from  one  of  the  other 
subject's  ears  than  from  his/her  own.  However,  we  feel  this  result  must  be 
interpreted  with  great  caution,  for  several  reasons.  First,  the  performance  of  the 
one  unusual  subject  was  generally  quite  poor,  and,  in  fact,  was  at  chance  in  free¬ 
hold.  Second,  only  one  of  the  four  subjects  showed  the  effect.  Third,  the  task 
involved  not  localization,  but  identification  of  a  target  source  from  a  small  group 
(5)  of  sources,  arranged  only  on  the  median  plane.  The  generalizability  of  these 
results  to  localization  conditions  such  as  we  have  studied  seems  questionable.  It 
is  possible  (and  we  feel  quite  likely)  that,  over  time,  with  visual  and  kinesthetic 
feedback  (neither  of  which  was  available  in  our  study),  listeners  could  become 
quite  proficient  localizing  stimuli  synthesized  from  non-individualized  HRTF  data 
(Searle,  1982). 

The  data  also  suggest  that  the  importance  of  the  5-10  kHz  frequency  region 
to  elevation  perception  (revealed  best  by  the  filtering  experiments)  must  be 
recognized,  if  veridical  elevation  perception  is  expected.  Great  care  must  be  taken 
in  making  HRTF  measurements  and  in  synthesizing  stimuli  to  preserve  spectral 
information  in  this  region.  This  means  that:  1)  the  probe  microphone  used  to 
measure  the  HRTF  must  be  sufficiently  close  to  the  eardrum  to  avoid  a  standing 
wave  null  in  the  5-  10  kHz  region;  and  2)  signal/noise  ratio  in  the  5-10  kHz  region 
must  be  sufficiently  high  to  preserve  spectral  detail;  only  high-energy  stimuli 
(i.e.,  not  transients)  and  low-noise  environments  (i.e.,  sound-proofed,  anechoic 
rooms)  should  be  used  for  HRTF  measurements. 

It  is  quite  possible  that  future  research  will  lead  to  stimulus  synthesis 
techniques  that  do  not  depend  on  individualized  HRTF  measurements.  For 
example,  it  may  be  the  case  that  mathematical  models  of  the  external  ear,  based 
on  a  small  number  of  anatomical  measurements,  could  serve  as  the  basis  for  the 
synthesis  algorithms.  However,  at  this  time,  it  is  our  conclusion  that  only 
carefully  measured  individualized  HRTF  measurements  can  convey  completely 
accurate  source  position  information  for  a  3-D  auditory  display. 
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Table  1 


Estimates  of  Canal  Length 
and  Distance  from  Eardrum 


ID 

Ear 

Visual 

Canal 

Length 

Acoustic 

Canal 

Length 

Visual 

Eardrum 

Distance 

Acoustic 

Eardrum 

Distance 

SDE 

L 

20.5 

23.6 

4.5 

7.6 

R 

21.0 

23.2 

6.0 

8.2 

SDL 

L 

25.0 

24.9 

10.8 

10.7 

R 

23.0 

23.9 

8.5 

9.4 

SDO 

L 

25.0 

25.6 

6.5 

7.1 

R 

24.0 

25.5 

7.0 

8.5 

SDP 

L 

27.0 

27.0 

9.5 

9.5 

R 

26.5 

28.9 

6.5 

8.9 

SED 

L 

22.0 

23.5 

7.7 

9.2 

R 

22.0 

23.4 

6.8 

8.2 

SER 

L 

24.5 

24.2 

12.1 

11.8 

R 

20.5 

24.2 

7.5 

11.2 

SET 

L 

20.0 

20.1 

5.9 

6.0 

R 

19.0 

19.7 

4.9 

5.6 

SGB 

L 

24.5 

26.2 

10.2 

11.9 

R 

26.0 

26.1 

14.6 

14.7 

SGD 

L 

20.0 

21.1 

4.0 

5.1 

R 

22.0 

21.8 

5.0 

4.8 

SGE 

L 

19.5 

22.0 

3.9 

6.4 

R 

20.5 

22.8 

3.7 

6.0 

SGG 

L 

22.5 

19.7 

10.9 

8.1 

R 

20.0 

2C.5 

8.8 

9.3 

SHD 

L 

19.5 

<20.3 

3.9 

<4.7 

R 

18.0 

20.2 

3.0 

5.2 

SHF 

L 

26.0 

29.2 

3.5 

6.7 

R 

25.0 

28.1 

2.5 

5.6 

Table  2 


Source  Positions 


f 


Front 


Side 


Back 


Low  Middle  High 


Azim. 

Elev. 

Azim. 

Elev. 

Azim. 

Elev, 

-15 

-36 

-15 

0 

-45 

36 

-45 

-36 

-45 

0 

0 

36 

30 

-36 

0 

0 

15 

36 

45 

-36 

15 

0 

30 

36 

-15 

-18 

45 

0 

-30 

54 

0 

-18 

-15 

18 

-45 

54 

30 

-18 

-30 

18 

15 

54 

45 

-18 

-45 

18 

0 

18 

45 

18 

90 

-36 

-75 

0 

-60 

36 

105 

-36 

-90 

0 

120 

36 

-60 

-18 

-105 

0 

90 

36 

-90 

-18 

60 

0 

-105 

54 

75 

-18 

90 

0 

75 

54 

105 

-18 

-75 

18 

90 

54 

-90 

18 

-120 

18 

75 

18 

105 

18 

-135 

-36 

-135 

0 

-135 

36 

-150 

-36 

150 

0 

-150 

36 

135 

-36 

165 

0 

165 

36 

150 

-36 

180 

0 

180 

36 

180 

-36 

135 

18 

-150 

54 

-135 

-18 

150 

18 
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Regional  Measures  of  Performance  with  KEMAR’s  II RTFs 
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1'  igure  1.  Photograph  of  one  of  the  two  ETYMOTIC  microphones  used 

to  measure  HRTFs  from  inside  subjects'  ear  canals.  The  thin 
silicone  probe  tube  is  less  than  1  mm  in  diameter 


Figure  2.  Photograph  of  a  custom  lucite  earmold  assembly,  with  the 

probe  microphone  in  place,  that  was  used  in  subjects'  ear 
canalsto  measure  HRTFs.  Note  that  the  earmold  is  trimmed 
and  boredout  so  that  its  acoustical  effects  would  be  minimal 
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Figure  3.  Photograph  of  the  inside  of  the  anechoic  chamber  used  both  for 
HRTF  measurements  and  for  psychophysical  testing.  During 
psychophysical  testing,  the  subject  is  blindfolded  and  the 
loudspeaker  arc  is  moved  by  an  assistant.  During 
HRTFmeasuremcnts,  the  subject  moves  the  loudspeaker  arc 
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Figure  4.  Block  diagram  showing  the  major  hardware  components  in 

the  set-up  used  both  for  HRTF  measurements  and  for 
psychophysical  testing 


35 


DB  SGE  LEFT  EAR  (AZIMUTH:  -150) 


c 

£ 


c 

c 

c 

co 


ir>  ■  — 
qj 

£  b 

tx  as 

-5 


■  c: 

a. 

01 


C  CQ 
.2  "3 


t/; 


o  O 
C  <N 

~<  =  >• 
-C  _Q 

— «  V.'  >~> 
^  3  = 
g  -C  CO 
CL  W 


cc 


J=  -C 


E-  c 


fe> 

CL 

O 

iC 


c 

o 

X 

03 

> 

CJ 

"o3 


•s  | 
S’0 

CO  <L 


CL 

c/: 

O 

c 

E 

o 

T3 

C 

cO 

u 

o 

T? 

3 

<L 

a 


CL 
<L 

«  £ 
T3  ^ 


jr,  o  "O  o 
a)  r- 

a.  -E 
a 


3  .i 


a  to 
co 


00  — 

f-H  <L 

'  C 
■T  & 


...  W 

03  O  q.  ■ 


00 


w  + 

O  CD 
CO  CO 


o 

<L 

JC 

3 

C/3 


C  T3 
C 
-  03 

co  s 

<L  ^ 

co 

-.DC 


0) 

JC 


-o 

c 

X 

<L 

c. 

CO 

to 

c 

C/3 

3 

T5 

a» 

c 


o 


o 

o 


o 

o 

CS 


e2 


XJ  <L 
C  X 
CO  -*-> 

'a?  ? 

3  § 

c/3 


O 

o  ^ 

o  a> 

CQ  £ 
T3  M 

c 


ce 

a> 


o 

<L 

S' 

3 

03 

CL 

-C 


CO 


c 

<L 

6 

CL 

J- 

3 

C/3 

CO 

CL 

E 

<L 


C  -X3 


-a 

03 


C/3 

(L 

-*-> 

CO 

§  ir 

u  .ti 


O  Q. 


X  ^ 

§>  e  > 

co  i-. 

E.5£ 


ID 

Hi 

g, 

£ 


26 


0009 


DH  SGE  RIGHT  EAR  (AZIMUTH:  -120) 


FREQUENCY 


1)1!  SC  IE  RKJHT  EAR  (AZIMUTH.  -30) 


FREQUENCY 


+  18 


100 


SGE  RIGHT  EAR  (AZIMUTH:  0) 


0000 1  0001-  000Z  IMMi  ()()C  001 


Figure  11. 


NCY 


.  Same  as  Figure  5,  except  for 


3ME 


4000 


Figure  5,  except  for 


4000 


SDP  LEFT  EAR  (AZIMUTH:  -150) 


e  as  Figure  17,  except  for  a 


S3H 


MSS 


DH  SDP  LEFT  EAR  (AZIMUTH:  -60) 


I)H  SDP  RIGHT  EAR  (AZIMUTH:  60) 


FREQUENCY 


Figure  23 


SDP  RIGHT  EAR  (AZIMUTH:  30) 


SDP  LEFT  EAR  (AZIMUTH:  6(J) 


7-1 


Figure  24.  Same  as  Figure  17,  except  for  a  source  at  60  degrees  azimuth 
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Figure  34.  Same  as  Figure  29,  except  for  a  source  at  0  degrees  azimuth 
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Figure  54.  Same  as  Figure  53,  except  for  0  degrees  azimuth  and  0  degrees  elevation 
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Figure  56.  Same  as  Figure  53,  except  for  -90  degrees  azimuth  and  +54  elevation 
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Figure  61.  Same  as  Figure  60,  except  for  su 
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Figure  63.  Magnitude  of  HRTFs  measured  at  90  degrees  azimuth  and  0  degrees  elevation  using  a 
periodic  pseudorandom  noise  and  a  cle  k  stimulus 
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Figure  65. 
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Figure  67.  Same  as  Figure  65.  except  for  Subject  SER 
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Figure  68.  Same  as  Figure  65,  except  for  Subject  SGE 
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Figure  69.  Same  as  Fipure  65,  except  for  Subject  SHD 
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Figure  71.  Same  as  Figure  70,  except  for  Subject  SDH 


Figure  72.  Same  as  Figure  70,  except  for  Subject  SDL 
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Figure  76.  Same  as  Figure  70,  except  for  Subject  SKI) 


Figure  77.  Same  as  Figure  70,  except  for  Subject  SER 
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Figure  81.  Same  as  Figure  70,  except  for  Subject  SGE 


Headphones  (SilUj 


Figure  83.  Same  as  Figure  70,  except  for  Subject  SHI) 


Figure  84.  Same  as  Figure  70,  except  for  Subject  SHI* 


Filters  20  db  rove 


The  right  panel  shows  data  from  a  condition  in  which  Subject  SDK  locahzec 
36  stimuli  synthesized  from  HRTFs  measured  from  Subject  SDO.  1  he 
format  of  the  figure  is  the  same  as  for  Figure  70.  The  panel  on  the  left  is 
included  for  comparison  and  shows  SDKs  performance  with  stimuli  based 
on  SDK’s  own  HRTFs 
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Figure  87.  Same  as  Figure  85,  except  for  Subject  SDL 
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Figure  89.  Same  as  Figure  85,  except  for  Subject  SED 


Figure  91.  Same  as  Figure  85,  except  for  Subject  SET 


Figure  92.  Same  as  Figure  85,  except  for  Subject  SOB 


SGE’s  Filters  20  dB  rove  (SGE)  ©  I  SDO’s  Filters  20  db  rove  (SGE) 
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Figure  93.  Same  as  Figure  85,  except  for  Subject  SGE 
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Figure  95.  Same  as  Figure  85,  except  for  Subject  SHI) 
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Figure  97.  Same  as  Figure  96,  except  for  Subject  SDO 
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Figure  99.  Same  as  Figure  96,  except  for  Subject  SGB 
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Azimuth  8  Azimuth 
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igure  103.  The  nght  panel  shows  data  irom  a  condition  in  wmcn  ouujet-i.  onov  iuwn« 
36  synthesized  stimuli  in  which  the  energy  above  5  kHz  had  been  removed. 
The  format  of  the  figure  is  the  same  as  for  Figure  70.  The  panel  on  the  left 
is  included  for  comparison  and  shows  SER’s  performance  with  unfiltered 
stimuli  1 
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Figure  104.  Same  as  Figure  103,  except  energy  below  5  kHz  was  removed 


Figure  105.  Same  as  Figure  103,  except  energy  below  10  kHz  was  removed 
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Figure  106.  Same  as  Figure  103,  except  energy 


